Hacemos analisis factorial para reducir las variables en otras variables resumen. Mientras la clusterización agrupaba filas, la factorización agrupa columnas. Pero, al igual que en clusterización, queremos saber si las nuevas variables tienen un nombre, al cual se le denomina técnicamente variable latente. En esta sesión exploraremos la data a ver qué emerge.
Para esta sesión trabajaremos con la data de estos links:
library(htmltab)
# links
happyL=c("https://en.wikipedia.org/wiki/World_Happiness_Report",
'//*[@id="mw-content-text"]/div/table/tbody')
demoL=c("https://en.wikipedia.org/wiki/Democracy_Index",
'//*[@id="mw-content-text"]/div/table[2]/tbody')
# carga
happy = htmltab(doc = happyL[1],which = happyL[2],encoding = "UTF-8")
demo = htmltab(doc = demoL[1], which = demoL[2], encoding = "UTF-8")
# limpieza
happy[,]=lapply(happy[,], trimws,whitespace = "[\\h\\v]") # no blanks
demo[,]=lapply(demo[,], trimws,whitespace = "[\\h\\v]") # no blanks
library(stringr) # nombres simples
names(happy)=str_split(names(happy)," ",simplify = T)[,1]
names(demo)=str_split(names(demo)," ",simplify = T)[,1]
## Formateo
# Eliminemos columnas que no usaremos esta vez:
happy$Overall=NULL
demo[,c(1,9,10)]=NULL
# También debemos tener nombres diferentes en los scores antes del merge:
names(happy)[names(happy)=="Score"]="ScoreHappy"
names(demo)[names(demo)=="Score"]="ScoreDemo"
# Tipo de variables:
## En demo:
demo[,-c(1)]=lapply(demo[,-c(1)],as.numeric)
# En happy:
happy[,-c(1)]=lapply(happy[,-c(1)],as.numeric)
# sin perdidos:
happy=na.omit(happy)
demo=na.omit(demo)
Presta atención al merge. Usualmente hacemos merge por default y luego perdemos varias filas:
nrow(merge(happy,demo))
## [1] 145
Hagamos un nuevo merge, donde nos quedemos con TODOS los paises que no estab en uno u otro data frame:
HappyDemo=merge(happy,demo,all.x=T, all.y=T)
Esta vez HappyDemo tiene varios paises de más, pero con valores perdidos y nombres que no pudieron coincidir. Veamos:
# formateando a
# HappyDemo[!complete.cases(HappyDemo),]
library(knitr)
library(kableExtra)
kable(HappyDemo[!complete.cases(HappyDemo),],type='html')%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 10)
| Country | ScoreHappy | GDP | Social | Healthy | Freedom | Generosity | Perceptions | ScoreDemo | ElecÂtoral | FunctioÂning | PoliÂticalparticiÂpation | PoliÂticalculture | CivilliberÂties | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | Belize | 5.956 | 0.807 | 1.101 | 0.474 | 0.593 | 0.183 | 0.089 | NA | NA | NA | NA | NA | NA |
| 27 | Cape Verde | NA | NA | NA | NA | NA | NA | NA | 7.88 | 9.17 | 7.86 | 6.67 | 6.88 | 8.82 |
| 33 | Comoros | NA | NA | NA | NA | NA | NA | NA | 3.71 | 4.33 | 2.21 | 4.44 | 3.75 | 3.82 |
| 34 | Congo (Brazzaville) | 4.559 | 0.682 | 0.811 | 0.343 | 0.514 | 0.091 | 0.077 | NA | NA | NA | NA | NA | NA |
| 35 | Congo (Kinshasa) | 4.245 | 0.069 | 1.136 | 0.204 | 0.312 | 0.197 | 0.052 | NA | NA | NA | NA | NA | NA |
| 38 | Cuba | NA | NA | NA | NA | NA | NA | NA | 3.00 | 1.08 | 3.57 | 3.33 | 4.38 | 2.65 |
| 41 | Democratic Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 1.49 | 0.50 | 0.71 | 2.22 | 3.13 | 0.88 |
| 43 | Djibouti | NA | NA | NA | NA | NA | NA | NA | 2.87 | 0.42 | 1.79 | 3.89 | 5.63 | 2.65 |
| 48 | Equatorial Guinea | NA | NA | NA | NA | NA | NA | NA | 1.92 | 0.00 | 0.43 | 3.33 | 4.38 | 1.47 |
| 49 | Eritrea | NA | NA | NA | NA | NA | NA | NA | 2.37 | 0.00 | 2.14 | 1.67 | 6.88 | 1.18 |
| 51 | Eswatini | NA | NA | NA | NA | NA | NA | NA | 3.03 | 0.92 | 2.86 | 2.22 | 5.63 | 3.53 |
| 53 | Fiji | NA | NA | NA | NA | NA | NA | NA | 5.85 | 6.58 | 5.36 | 6.11 | 5.63 | 5.59 |
| 57 | Gambia | NA | NA | NA | NA | NA | NA | NA | 4.31 | 4.48 | 4.29 | 3.33 | 5.63 | 3.82 |
| 64 | Guinea-Bissau | NA | NA | NA | NA | NA | NA | NA | 1.98 | 1.67 | 0.00 | 2.78 | 3.13 | 2.35 |
| 65 | Guyana | NA | NA | NA | NA | NA | NA | NA | 6.67 | 9.17 | 5.71 | 6.11 | 5.00 | 7.35 |
| 84 | Kosovo | 5.662 | 0.855 | 1.230 | 0.578 | 0.448 | 0.274 | 0.023 | NA | NA | NA | NA | NA | NA |
| 95 | Macedonia | 5.185 | 0.959 | 1.239 | 0.691 | 0.394 | 0.173 | 0.052 | NA | NA | NA | NA | NA | NA |
| 117 | North Korea | NA | NA | NA | NA | NA | NA | NA | 1.08 | 0.00 | 2.50 | 1.67 | 1.25 | 0.00 |
| 118 | North Macedonia | NA | NA | NA | NA | NA | NA | NA | 5.87 | 6.50 | 5.36 | 6.67 | 3.75 | 7.06 |
| 119 | Northern Cyprus | 5.835 | 1.229 | 1.211 | 0.909 | 0.495 | 0.179 | 0.154 | NA | NA | NA | NA | NA | NA |
| 121 | Oman | NA | NA | NA | NA | NA | NA | NA | 3.04 | 0.00 | 3.93 | 2.78 | 4.38 | 4.12 |
| 123 | Palestine | NA | NA | NA | NA | NA | NA | NA | 4.39 | 3.83 | 2.14 | 7.78 | 4.38 | 3.82 |
| 124 | Palestinian Territories | 4.743 | 0.642 | 1.217 | 0.602 | 0.266 | 0.086 | 0.076 | NA | NA | NA | NA | NA | NA |
| 126 | Papua New Guinea | NA | NA | NA | NA | NA | NA | NA | 6.03 | 6.92 | 6.07 | 3.89 | 5.63 | 7.65 |
| 133 | Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 3.31 | 3.17 | 2.50 | 3.89 | 3.75 | 3.24 |
| 144 | Somalia | 4.982 | 0.000 | 0.712 | 0.115 | 0.674 | 0.238 | 0.282 | NA | NA | NA | NA | NA | NA |
| 147 | South Sudan | 3.254 | 0.337 | 0.608 | 0.177 | 0.112 | 0.224 | 0.106 | NA | NA | NA | NA | NA | NA |
| 151 | Suriname | NA | NA | NA | NA | NA | NA | NA | 6.98 | 9.17 | 6.43 | 6.67 | 5.00 | 7.65 |
| 159 | Timor-Leste | NA | NA | NA | NA | NA | NA | NA | 7.19 | 9.08 | 6.79 | 5.56 | 6.88 | 7.65 |
| 161 | Trinidad & Tobago | 6.192 | 1.223 | 1.492 | 0.564 | 0.575 | 0.171 | 0.019 | NA | NA | NA | NA | NA | NA |
| 162 | Trinidad and Tobago | NA | NA | NA | NA | NA | NA | NA | 7.16 | 9.58 | 7.14 | 6.11 | 5.63 | 7.35 |
| 168 | United Arab Emirates | NA | NA | NA | NA | NA | NA | NA | 2.76 | 0.00 | 3.93 | 2.22 | 5.00 | 2.65 |
De lo anterior date cuenta que, por un lado, hay paises que les falta un bloque de indicadores, y que en muchos casos los nombres están mal escritos. Podemos recuperar algunos, pero en la data original:
# cambiemos a nombres usados por otra tabla:
## en demo por happy
demo[demo$Country=="Democratic Republic of the Congo",'Country']="Congo (Kinshasa)"
demo[demo$Country=="Republic of the Congo",'Country']="Congo (Brazzaville)"
demo[demo$Country=="Trinidad and Tobago",'Country']="Trinidad & Tobago"
demo[demo$Country=="North Macedonia",'Country']="Macedonia"
## en happy por demo
happy[happy$Country=="Palestinian Territories",'Country']="Palestine"
Luego de esos ajustes veamos:
HappyDemo=merge(happy,demo) # re creando HappyDemo
nrow(HappyDemo)
## [1] 150
En efecto se recuperaron 5 paises, asi quedará.
El análisis factorial requiere que hagamos algunas observaciones previas.
theData=HappyDemo[,-c(1,2,9)] # sin los Scores ni nombre de paÃs.
# esta es:
library(polycor)
corMatrix=polycor::hetcor(theData)$correlations
library(ggcorrplot)
ggcorrplot(corMatrix)
ggcorrplot(corMatrix,
p.mat = cor_pmat(corMatrix),
insig = "blank")
Si puedes ver bloques correlacionados, hay esperanza de un buen analisis factorial.
library(psych)
psych::KMO(corMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA = 0.87
## MSA for each item =
## GDP Social Healthy
## 0.82 0.92 0.87
## Freedom Generosity Perceptions
## 0.86 0.62 0.80
## ElecÂtoral FunctioÂning PoliÂticalparticiÂpation
## 0.81 0.92 0.95
## PoliÂticalculture CivilliberÂties
## 0.89 0.84
Aqui hay dos pruebas:
cortest.bartlett(corMatrix,n=nrow(theData))$p.value>0.05
## [1] FALSE
library(matrixcalc)
is.singular.matrix(corMatrix)
## [1] FALSE
fa.parallel(theData,fm = 'ML', fa = 'fa')
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
Se sugieren 3, veamos:
library(GPArotation)
resfa <- fa(theData,nfactors = 3,cor = 'mixed',rotate = "varimax",fm="minres")
##
## mixed.cor is deprecated, please use mixedCor.
print(resfa$loadings)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.268 0.927 0.126
## Social 0.321 0.663 0.152
## Healthy 0.345 0.803 0.153
## Freedom 0.255 0.284 0.516
## Generosity 0.612
## Perceptions 0.266 0.683
## ElecÂtoral 0.935 0.179
## FunctioÂning 0.753 0.429 0.356
## PoliÂticalparticiÂpation 0.696 0.341 0.148
## PoliÂticalculture 0.520 0.304 0.507
## CivilliberÂties 0.919 0.317
##
## MR1 MR3 MR2
## SS loadings 3.404 2.629 1.582
## Proportion Var 0.309 0.239 0.144
## Cumulative Var 0.309 0.548 0.692
print(resfa$loadings,cutoff = 0.51)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.927
## Social 0.663
## Healthy 0.803
## Freedom 0.516
## Generosity 0.612
## Perceptions 0.683
## ElecÂtoral 0.935
## FunctioÂning 0.753
## PoliÂticalparticiÂpation 0.696
## PoliÂticalculture 0.520
## CivilliberÂties 0.919
##
## MR1 MR3 MR2
## SS loadings 3.404 2.629 1.582
## Proportion Var 0.309 0.239 0.144
## Cumulative Var 0.309 0.548 0.692
Cuando logramos que cada variable se vaya a un factor, tenemos una estructura simple.
fa.diagram(resfa)
resfa$crms
## [1] 0.03691667
resfa$RMSEA
## RMSEA lower upper confidence
## 0.07968360 0.04056531 0.10959050 0.90000000
resfa$TLI
## [1] 0.9614123
¿Qué nombres les darÃas?
as.data.frame(resfa$scores)
## MR1 MR3 MR2
## 1 -0.44581478 -1.368481159 -1.098638556
## 2 0.48033887 0.090566181 -0.856703304
## 3 -1.00967021 0.585592494 -0.898629955
## 4 -1.01725350 -0.201269389 -0.687036544
## 5 0.73624548 0.463261056 -0.807836146
## 6 0.01584794 -0.013422423 -1.259526261
## 7 1.17998664 0.708063729 1.622400763
## 8 0.82303951 0.935375820 0.642938605
## 9 -1.50616013 0.826948485 -0.466857259
## 10 -2.01760261 1.621600785 0.247169356
## 11 0.21131821 -1.001237965 0.199771946
## 12 -1.66996548 0.917390607 0.039008363
## 13 0.79578389 0.823135593 0.752042977
## 14 0.45082762 -1.668920643 0.253673440
## 15 -0.05641176 -0.551594725 0.972541028
## 16 0.50412428 -0.397523850 -0.646284689
## 17 0.08318741 0.219130546 -1.327515737
## 18 1.06057976 -0.053434870 -0.211901993
## 19 0.92310546 0.142136942 -0.826976916
## 20 0.86850458 0.416442028 -1.244491103
## 21 0.06960676 -1.591645737 0.393753312
## 22 -1.01989359 -1.843388059 -0.236504917
## 23 -1.06874704 -0.634535152 1.193457361
## 24 -0.96478982 -0.850195652 -0.051683999
## 25 1.16412448 0.699895023 1.674473964
## 26 -0.82185506 -2.292555222 -0.715561120
## 27 -1.21459683 -1.193530366 -0.601933540
## 28 1.12549099 0.310091463 0.169110288
## 29 -2.02015893 0.930753211 0.730951312
## 30 0.92382717 0.031278951 -0.555349193
## 31 -0.99341039 -0.342515557 -0.440808189
## 32 -1.34392495 -1.731710198 -0.056671606
## 33 1.11687607 0.157292047 0.196185885
## 34 0.53162832 0.480393287 -0.921160352
## 35 0.85011409 0.764563718 -0.554786856
## 36 0.79624528 0.870222852 -0.688471459
## 37 0.94143175 0.714511451 2.121511960
## 38 0.54715397 0.139014157 -0.484753176
## 39 0.50703962 -0.003093982 -0.455381011
## 40 -1.06527183 0.235186366 -0.574312978
## 41 0.68842805 -0.254455499 -1.153543849
## 42 0.89257410 0.571923939 0.195893974
## 43 -1.11306931 -1.184726903 1.091758194
## 44 1.10985513 0.659666822 1.653651197
## 45 0.81540073 0.946585860 -0.223836118
## 46 -1.16044066 0.678685564 -0.957969601
## 47 0.07691544 -0.182138682 -0.664590958
## 48 1.00677308 0.796534944 1.072782633
## 49 0.57933687 -1.062947130 0.102809649
## 50 0.94020552 0.630579743 -1.527306759
## 51 0.39726736 -0.347429130 -0.255826679
## 52 -0.78707349 -1.305764808 -0.336231492
## 53 0.46456787 -1.864571231 0.146446067
## 54 0.55857179 -0.779649140 -0.409560785
## 55 -0.17346845 1.441868311 1.010220747
## 56 0.47287912 0.678256215 -1.060140370
## 57 1.10939209 0.852767876 1.372869501
## 58 0.88601738 -0.785000631 0.101840109
## 59 0.11795711 -0.206917896 0.927890206
## 60 -2.04168760 0.889625721 0.361186027
## 61 -0.95035625 0.548246073 -1.155387122
## 62 0.98365854 1.030709219 1.423727390
## 63 0.11396222 0.866304691 1.406808120
## 64 0.79327364 0.975072334 -0.956233635
## 65 -0.44161397 -1.077594626 0.006215688
## 66 1.04591371 -0.277244425 -0.267832175
## 67 0.76222476 1.001900825 0.153818849
## 68 -0.87467550 0.112047320 0.109949230
## 69 -1.65050509 1.197143904 -0.274185318
## 70 -0.33522854 -0.966870000 1.027485617
## 71 -1.49516709 1.803872673 -0.233903317
## 72 0.04293901 -0.757298467 -0.217984352
## 73 -1.78599653 -0.152032970 0.993701862
## 74 0.97925834 0.456397863 -0.826556978
## 75 -0.77804159 0.538112456 -0.078162481
## 76 0.94805064 -1.522705964 -0.195547331
## 77 0.63831210 -2.262436649 -0.239690918
## 78 -1.68409623 0.712425727 -0.388757551
## 79 1.04840427 0.693005613 -1.315279845
## 80 0.88865588 1.352624510 1.111733995
## 81 0.34657638 0.238754003 -0.733574453
## 82 0.23975441 -1.637992753 -0.252933264
## 83 0.50924510 -2.077682623 0.466171447
## 84 0.15039097 0.505948071 0.394348464
## 85 0.63440877 -1.617044211 -0.370645384
## 86 0.80740644 0.684542762 1.164305786
## 87 -0.34912495 -0.757095434 -0.595649455
## 88 1.11202349 0.179626701 0.506248330
## 89 0.27914706 0.459796315 -0.816176737
## 90 0.69514799 -0.572741827 -0.979962610
## 91 0.71917670 -0.125566004 -0.508518883
## 92 0.21186259 0.422573984 -0.813437625
## 93 -0.45635583 -0.087937715 -0.294497203
## 94 -0.48561718 -1.783217143 0.597131853
## 95 -1.14335379 -0.633287427 1.952906573
## 96 0.45215981 -0.099779520 -0.482248625
## 97 0.08717409 -1.151560004 0.852537682
## 98 0.92601217 0.806327162 1.601247115
## 99 1.26762633 0.471511002 2.002009591
## 100 -0.86360371 -0.173889833 0.193001205
## 101 0.06111228 -2.017396366 -0.389845399
## 102 -0.19638111 -0.770587015 -0.303260964
## 103 1.02148715 1.006708243 1.940850302
## 104 -0.03154804 -0.807150396 -0.230102465
## 105 -0.61081426 -0.232953321 -0.754153058
## 106 0.73062487 0.520443632 -0.681617305
## 107 0.73511658 -0.235024164 -0.469515639
## 108 0.70482067 0.052501923 -0.874432123
## 109 0.72297250 -0.435483094 -0.288170577
## 110 0.58322251 0.726880192 -0.870804685
## 111 0.99402547 0.733652385 -0.644937904
## 112 -2.01905922 2.325144966 0.630535581
## 113 0.60145626 0.544545448 -1.211233623
## 114 -1.42279012 1.248943430 -1.411789904
## 115 -0.90602330 -1.468904392 2.288749138
## 116 -2.34549497 1.837085556 -0.350236398
## 117 0.72638640 -1.398205292 0.210006801
## 118 0.63755954 0.175566272 -0.960254960
## 119 0.29100869 -2.006041810 -0.228120837
## 120 -0.34592659 1.610737584 1.856898063
## 121 0.77684080 0.735887414 -0.978620151
## 122 0.71862466 0.806359213 -0.351334507
## 123 0.74986983 -0.082771999 -0.229027439
## 124 0.81387127 0.840110174 -0.338026089
## 125 0.86650955 0.917892294 -0.285121513
## 126 0.22068499 -0.020427117 0.300412916
## 127 -1.62467476 -0.371226704 -0.280898781
## 128 0.96997873 0.743436655 2.178871590
## 129 0.83345428 0.976979664 1.818733107
## 130 -2.13333912 -0.226400585 0.188606775
## 131 0.97446367 1.032129797 -0.543468218
## 132 -1.78808895 -0.477977648 0.094151298
## 133 0.07830330 -1.310118791 0.599597342
## 134 -0.56502405 0.585696732 0.289474776
## 135 -0.72763414 -1.559224420 -0.121188819
## 136 0.55815746 0.651184758 -0.440521935
## 137 0.09861209 0.105761486 -0.500902501
## 138 -1.22517536 0.961002640 -0.129505220
## 139 -2.31866025 0.943419620 -0.265098142
## 140 0.27386398 -1.633603584 0.420051497
## 141 0.20270976 -0.139502857 -0.874216960
## 142 1.00880031 0.718322029 0.184231055
## 143 0.57524211 1.092845257 0.369168539
## 144 1.29100183 0.219300594 0.406160014
## 145 -2.06860237 0.048333426 1.589742380
## 146 -1.20739186 0.817518450 -1.210069407
## 147 -1.55297428 0.174837293 0.656505739
## 148 -1.75630380 -0.618763549 -0.421561741
## 149 0.34238324 -1.098811528 0.358161254
## 150 -1.01759077 -1.082303889 0.303390169
HappyDemoFA=cbind(HappyDemo[1],as.data.frame(resfa$scores))
library(plotly)
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))
RECORDANDO:
library(fpc)
library(cluster)
library(dbscan)
# YA NO NECESITAS CMD para HappyDemoFA[,c(2:4)]
g.dist.cmd = daisy(HappyDemoFA[,c(2:4)], metric = 'euclidean')
kNNdistplot(g.dist.cmd, k=3)
Para tener una idea de cada quien:
resDB=fpc::dbscan(g.dist.cmd, eps=0.6, MinPts=3,method = 'dist')
HappyDemoFA$clustDB=as.factor(resDB$cluster)
aggregate(cbind(MR1, MR2,MR3) # dependientes
~ clustDB, # nivel
data = HappyDemoFA, # data
max) # operacion
## clustDB MR1 MR2 MR3
## 1 0 0.2206850 2.2887491 2.3251450
## 2 1 1.2910018 2.1788716 1.3526245
## 3 2 -0.4563558 -0.2944972 1.2489434
## 4 3 -1.2251754 0.7309513 1.1971439
## 5 4 -0.5650241 0.2894748 0.5856967
## 6 5 -1.6246748 0.1886068 -0.2264006
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country, color = ~clustDB) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))